Nowadays, we are stepping into an era of information explosion. Social media received much more attention than traditional media because it creates a convenient environment for people from different places to share information and make connections. Twitter is one of the most popular social media platforms with 237.8 million monetizable daily active users. Every second, there are countless posts on Twitter expressing opinions and emotions about a variety of topics. Twitter is featured by using hashtags, a combination of a # symbol and index keywords, to help its users easily find and follow the topics they are interested in. Another distinguished feature of Twitter is Twitter Trends, which detects the most popular always-changing trending topics in different countries. Twitter also provides an opportunity for users to retrieve tweets through Twitter API without opening the application. It involves counting methods and various machine learning algorithms to identify trending topics on Twitter through extracting targeted key words (Rodrigues et al. 2021).
The biggest search engine Google also provides a free tool, Google Trends, for analyzing the popularity of Google search terms using real-time data. Similar to Twitter API, Google Trends allows users to capture what people are searching in different periods of time, season, and location. The main purpose of Google Trends is providing users with key insights about the volume of Google searches related to specific search terms, as well as showing the relative popularity of these search queries based on geographical locations.
Twitter API and Google Trends enable us to easily extract the specific information and get deeper into the topics we care about. When we started thinking about what topic we are going to analyze in this project, we noticed that despite the regular explore tabs “News”, “Sports”, and “Entertainment”, there is a new tab “World Cup” at the top of the Twitter homepage because of the ongoing 2022 Qatar World Cup, which is the most viewed and followed sporting event in the world and now dominating online discussions, especially in Twitter. Elon Musk revealed that Twitter traffic related to the 2022 Qatar World Cup almost hit 20,000 tweets per second. It aroused our interest to analyze the sentiment related to the topic “World Cup” and how people from worldwide or a specific region respond to it. When users click the tab “World Cup”, they can see several featured Tweets and accounts as they scroll down the page. However, unlike other tabs, there’s no trending hashtags or topics available. Thus, we tracked Tweets before and after each quarter-final match both around the world and within the U.S.. By comparing the trends in the two areas, we can see how people in different areas react differently to the World Cup. The link to this project’s github repo is https://github.com/EEElisa/WorldCupAnalysis.git.
In this project, the data sources are Twitter and Google Trends. As for Twitter, we gathered and cached Tweets at several time points during the quarter-finals of the World Cup (especially before and after each match) with search scope restricted to the U.S. and the whole world. As for Google Trends, we accessed the search volumes of “World Cup” with the same two search scopes, U.S. and the world.
To get an overall trend of the World Cup, we use “gtrendsR” package to access the search volumes of the query “World Cup” from the very beginning of the game season to the end of the quarter-finals, i.e., 2022-11-20 to 2022-12-10, in both the worldwide scope and the United States.
searchtime = "2022-11-20 2022-12-10"
worldwide_trend <- gtrends(c("World Cup"), time = searchtime, low_search_volume = TRUE)
us_worldcup <- gtrends(c("World Cup"), geo = "US", time = searchtime, low_search_volume = TRUE)
world_country_hit <- worldwide_trend$interest_by_country
us_regions_hit <- us_worldcup$interest_by_region
The function “gtrends” returns the search volumes in an 100-point scale over time. The results for the worldwide scope include the search volumes for each country. When we restrict the areas to be the U.S., the results can tell us the search volumes for each state. So we extract the table “interest_by_country” and “interest_by_region” from the two outputs respectively.
The main data source is the Tweets gathered throughout the game season until the end of Quarter-Finals on 2022-12-10. We use the function “search_tweets” from rtweet to cache Tweets before and after each Quarter-Final game using the search query “WorldCup OR worldcup” and save the results into local csv files with the search time in the file name.
By running the following chunk, users will cache 1000 recent Tweets containing the search query “WorldCup OR worldcup” into a local csv file in the folder named “cache”.The file name includes the search time for convenience of future analysis.
cache_path <- paste(getwd(),"/cache/",sep="")
auth_setup_default()
keywords_tweets <- "WorldCup OR worldcup"
usa <- lookup_coords("usa")
current_time <- Sys.time()
us_tweets <- search_tweets(keywords_tweets, geocode=usa, lang="en", n=1000)
us_tweets$searched_at <- current_time
us_tweets <- us_tweets %>%
select(full_text, searched_at)
filename_US <- paste("US-",current_time,".csv",sep='')
path_US <- paste(cache_path,filename_US,sep='')
write.table(us_tweets, path_US, row.names=TRUE)
current_time <- Sys.time()
world_tweets <- search_tweets(keywords_tweets, lang="en",n=1000)
world_tweets$searched_at <- current_time
world_tweets <- world_tweets %>%
select(full_text, searched_at)
filename_world <- paste("World-",current_time,".csv",sep='')
path_world <- paste(cache_path,filename_world,sep='')
write.table(world_tweets, path_world, row.names=TRUE)
We read all the previously cached files into one single data frame for further analysis.
# read cached files
read_cache <- function(file_path){
tweets_total <- data.frame()
n = length(file_path)
for (i in 1:n) {
filepath <- paste(base_path, file_path[i], sep="")
tweets <- read.csv(filepath)
tweets_total <- rbind(tweets_total,tweets)
}
colnames(tweets_total) <- c("value")
return(tweets_total)
}
us_tweets_total <- read_cache(us_files)
n_us <- length(us_tweets_total)
world_tweets_total <- read_cache(world_files)
n_world <- length(world_tweets_total)
We use line chart to visualize the search trends of “World Cup” around the world and within the U.S. to get an overview of the popularity.
The line charts show that the popularity achieved a peak on Nov. 29 when there were heated matches between Ecuador and Senegal, Netherlands and Qatar, Iran and USA, Wales and England. In the end of Group Stage on Dec. 2 and the last day of Round of 16 on Dec. 6, the popularity also increased.
We want to compare the popularity of World Cup among different states of the U.S. via a heat map. What we have so far is the search volumes in each state in an 100-point scale. But one thing to note is that the function “gtrends” doesn’t take population into consideration. Thus, we need to re-scale the search volumes according to the varying size of population.
The package “usa” includes the population estimates on September 26, 2019.
if(!require(usa)) install.packages('usa')
library(usa)
# population data is for 2019
facts <- facts %>%
select(name, population) %>%
arrange(desc(population))
if(!require(scales)) install.packages('scales')
library(scales)
us_regions_hit <- us_worldcup$interest_by_region %>%
rename(state=location, hits=hits) %>%
inner_join(facts, by=c('state'='name'))
# us_regions_hit %>% select("state", "hits","population")
By summing up all the population, we got the total population of the U.S in 2019, which is 327167434. Then, we introduced a population weight factor \(w_{pop} = pop_{state}/pop_{us}\) and multiply the column “hits”, i.e., the search volume, with the corresponding population weight \(w_{pop}\). For convenience of comparison, we rescaled the search volume to 100-point.
We visualized the search volumes among the U.S by a heat map where the darker red color means higher popularity in that area.
Furthermore, we extracted the top 10 regions where the World Cup was
the most popular.
For sentiment analysis, we generated a courpus using the Tweeted collected before. We cleaned the corpus by removing numbers, punctuation, white space, capital letters, stop words, etc. Regarding the stop words, we included the words appearing in the search query such as “worldcup”, “word”, and “cup”. Also, we added other words that were highly related to the World Cup itself rather than sentiments in addition to the basic English stop words given by the function “stopwords(”english”)“. Finally, we passed the preprocessed corpus into funtion”TermDocumentMatrix” to create a term document matrix. For repeated usa, we wrote a function named “preprocess_into_tdm” to generate the term document matrix given Tweets text.
# function to preprocess the Tweets and transform it into words
preprocess_into_tdm <- function(Tweets){
tweets.corpus <- Corpus(VectorSource(Tweets)) %>%
tm_map(removeNumbers) %>% # removes numbers from text
tm_map(removePunctuation) %>% # removes punctuation from text
tm_map(stripWhitespace) %>% # trims the text of whitespace
tm_map(content_transformer(tolower)) %>% # convert text to lowercase
tm_map(removeWords,stopwords) %>% # remove stopwords
tm_map(removeWords,stopwords)# remove stopwords not removed from previous line
tdm <- TermDocumentMatrix(tweets.corpus) %>% # create a term document matrix
as.matrix()
return(tdm)
}
With the term document matrix at hand, we conducted the sentiment analysis then. In this part, we used the function “get_sentiments(”nrc”)” from the package “tidytext” to get a dictionary where 13,872 words were assigned a proper sentiment. We extracted unique words from the term document matrix and counted the occurrences of each word into a tibble. By conducting an inner join between the word table and the nrc sentiment table, the self-defined function “sentiment_analysis(tdm)” will return the table of emotions and their percentage.
# function to conduct sentiment analysis
sentiment_analysis <- function(tdm) {
words <- unique(data.frame(word = names(sort(rowSums(tdm), decreasing = TRUE))))
words <- as_tibble(words)
senti = inner_join(words, get_sentiments("nrc"),by=c("word"="word")) %>%
count(sentiment)
senti$percent = (senti$n/sum(senti$n))*100
return(senti)
}
By comparing the results of two sentiment analysis (using Tweets during the quarter-finals from U.S. and the world), we can see that the overall distribution of sentiments are similar. For instance, “positive” is at top of the rank, followed by “negative” and “trust”, and “sadness”, “surprise” and “disgust” are three least frequent emotions. However, there do exist several differences. The percentage of “joy” in U.S. was higher than the worldwide Tweets while the percentage of “anticipate” in U.S. was slightly lower than the world’s result. The results demonstrates that the distributions of sentiments within the two days are different from each other. Moreover, the analysis can be done on an ongoing basis as the game season progresses.
First, we transform the Tweets that were the user gathered a moment ago into a term document matrix.
tdm <- preprocess_into_tdm(tweets$text)
Then, we built a corpus using all the texts and transformed it into a term document matrix. To plot the word cloud, we calculated two attributes: word and frequency.
words <- sort(rowSums(tdm), decreasing = TRUE) # count all occurences of each word and group them
worldwide_tweets <- data.frame(word = names(words), freq = words) # convert it to a dataframe
head(worldwide_tweets)
In the world cloud, we didn’t extract the emojis as they are straightforward indicators of emotions, which strongly motivates us to do further sentiment analysis of these Tweets. The below plot is an example generated from the Tweets that were collected when users run the rmd file.
Following the above procedure, we visualized the Tweets collected during each quarter-final match using word cloud for both U.S. and the world. As all the plots were generated throughout the game season, we directly inserted them as follows.
These 16 graphs represent how people from worldwide and in the United States react to WorldCup separately through extracting the most frequently mentioned words when they mentioned WorldCup. The bigger the word appears, the more often it was contained in the tweets by users.
For example, in the first row, there are two graphs about the first quarter final: Brazil versus Croatia. The left graph represents how people in the United States respond to WorldCup, we can see that WorldCup mainly links to “brazil”, “croatia”, and “brazilvscroatia”. While the right graph represents worldwide trends related to WorldCup and the popular words were “argentina” and “netherlands”. Other than comparing the searching trends between US and worldwide, through the columns of the graphs, we can see how the overall trend was changing with different matches.
This study aims to use Twitter API and Google Trends to extract trending words and topics with respect to the ongoing 2022 Qatar World Cup, as well as to conduct sentiment analysis. According to the above graphs, we can conclude that there is a discrepancy between US Twitter trends and worldwide trends, which can be supported by multiple perspectives.
By comparing the trending hashtags during the quarter-finals, we can see that the trending hashtags of are relatively lagging behind the match schedule while the ones of U.S. reflect a real-time update. As for the sentiment analysis, despite the overall similar pattern, there are some differences that are worth noting. The percentage of “joy” in U.S. was higher than the worldwide Tweets but the sentiment “anticipate” was slightly less popular than the world’s result. The word cloud is another straightforward display of the most popular words in chosen time points. For example, after Croatia knocked Brazil out of the World Cup, the word cloud shows that the most frequent words appearing in Tweets posted by users in the United States were “brazil ’’ and the emoji of Croatia’s national flag. It indicates that these users followed up-to-date information and posted the tweets in real-time. By contrast, when analyzing the worldwide trend at the same time after Brazil was beaten by Croatia, we surprisingly found that the keywords related to WorldCup were not up-to-date with the ongoing of the match. Instead, the most frequent words were “win”, “earn”, “argentina”, and “netherlands”, which was more related to the earlier matches. In that case, we can conclude that Tweets in the United States followed up-to-date information about World Cup matches while Tweets in the worldwide range seemed to have a delay in information and not about the latest matches.
The limitations of this study mainly come from the data sources. Firstly, the sample size of the U.S. and the world are largely unequal. Even though we took the difference into account and used the percentage rather than absolute values for comparisons, it’s still a significant source of bias. Secondly, the cached Tweets were gathered only during the quarter-finals but the search volumes given by Google Trends indicates that the popularity of World Cup reached the peak at around the end of November. If more data were available for the analysis, the results can contain more insights. Thirdly, the study doesn’t explain the distinctions of sentiment or trending hashtags between two sets of Tweets. Further studies can examine the popular words related to each sentiment. Moreover, the analysis can be generalized to compare the sentiment and trends among different states in U.S. or different countries around the world.